Failed Runs¶
Dealing with Failures¶
In a large dataset, it’s not impossible to imagine that some runners will fail to run for unforseen circumstances. Failures can occur at any point: In the shell, scheduler, or python, for example. Runners will still be marked as “satisfied”, if that is the case, but the a summary of the error message will be available in ds.errors
.
However it is not enough to know that your calculation has failed, so lets explore some tools that help you figure out why they failed.
We’ll cover some common failure modes, so you can get a sense of what they look like:
Function Error
Argument Error
Submission Error
Walltime Issue
Function Error¶
There is a disconnect (however small) between writing and running the function. This can lead to small issues that ultimately cause the job to fail.
For this example, we’ll simulate a broken function by attempting to access a variable that doesn’t exist:
[2]:
import time
from remotemanager import Dataset
def multiply(a, b):
foo
return a * b
ds = Dataset(multiply, skip=False)
ds.append_run({'a': 2, 'b': 2})
ds.run()
ds.wait(1, 10)
ds.fetch_results()
appended run runner-0
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 5 Files... Done
Remotely executing 1/1 Runners
Fetching results
Transferring 1 File... Done
Run complete, lets see what happened:
[3]:
ds.results
Warning! Found 1 error(s), also check the `errors` property!
[3]:
[RunnerFailedError('NameError: name 'foo' is not defined')]
No results, and a warning saying that there is something in the errors
property, lets check it.
[4]:
ds.errors
[4]:
["NameError: name 'foo' is not defined"]
Here’s the error we were expecting.
Key indicators of failure are:
An unexpected
None
resultContent in the
errors
property
Note
It is possible to have a populated error file, but a sucessful run (some schedulers put warnings in stderr). This is why the message for this is only a warning. We will see this later in this tutorial.
Function Fixes¶
Since the identity of the dataset is tied heavily to the function, the only option for fixing the function is to create a new dataset.
If you already have submitted runs that you don’t want to resubmit, however, you can copy them across to your new dataset, preserving their status. This is best done by ds_new.copy_runners(ds)
.
Lets fix this function and rerun:
[5]:
def multiply(a, b):
return a * b
ds_fixed = Dataset(multiply, skip=False)
ds_fixed.copy_runners(ds)
ds_fixed.runners
[5]:
[dataset-6fd64b82-runner-0]
Now we have our runner in our new dataset. This works because while the Dataset
handles the function, a Runner
only cares about the arguments. So as the function signatures match, this copy across will allow you to preserve your work.
Note
You can also select runners to insert, using ds.insert_runner(runner)
. Internally copy_runners
uses this function by looping over the runners
property of the given dataset.
Note how since the runners are copied across unchanged, they retain their run state. So if we want to rerun, we must force:
[6]:
ds_fixed.run()
Staging Dataset... No Runners staged
No Transfer required
[6]:
False
[7]:
ds_fixed.run(force=True)
ds_fixed.wait(1, 10)
ds_fixed.fetch_results()
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 5 Files... Done
Remotely executing 1/1 Runners
Fetching results
Transferring 2 Files... Done
[8]:
ds_fixed.results
[8]:
[4]
Argument Error¶
When generating runs, sometimes the arguments themselves can be at fault. We can demonstrate this simply by adding a runner for the multiply
function that has None
as one of the args.
[10]:
ds = Dataset(multiply, skip=False)
ds.append_run({"a": 10, "b": 5})
ds.append_run({"a": 7, "b": None})
ds.run()
ds.wait(1, 10)
ds.fetch_results()
ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 3 Files... Done
Warning! Found 1 error(s), also check the `errors` property!
[10]:
[50,
RunnerFailedError('TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'')]
[11]:
ds.errors
[11]:
[None, "TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'"]
As expected, the 2nd runner failed.
Argument Fixes¶
Since the args themselves are at fault, and runners are responsible for holding the args, we should remove and replace this runner.
For this purpose, Dataset
has a remove_run
function:
[12]:
ds.remove_run({"a": 7, "b": None})
removed runner dataset-6fd64b82-runner-1
[12]:
True
For more information on running runners (and removing other bad data), see the Dataset Cleaning Tutorial
Submission Error - python
¶
Supercomputers are often very specific about their environments and software. It’s very easy to specify an incorrect module, python version or submitter. This is often solved within the URL
, however the issue can arise from the extra
in the dataset or runner. In any case, simply updating the incorrect line and resubmitting is often enough to resolve the issues.
Lets set python
to something that doesn’t exist to simulate this:
[14]:
from remotemanager import URL
url = URL(python="foo")
ds = Dataset(multiply, url=url, skip=False)
ds.append_run({"a": 10, "b": 5}, extra="bar")
ds.append_run({"a": 7, "b": 15})
ds.run()
ds.wait(1, 10)
ds.fetch_results()
ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 2 Files... Done
Warning! Found 2 error(s), also check the `errors` property!
[14]:
[RunnerFailedError('dataset-6fd64b82-runner-0-jobscript.sh: line 6: foo: command not found'),
RunnerFailedError('dataset-6fd64b82-runner-1-jobscript.sh: line 5: foo: command not found')]
[15]:
ds.errors
[15]:
['dataset-6fd64b82-runner-0-jobscript.sh: line 6: foo: command not found',
'dataset-6fd64b82-runner-1-jobscript.sh: line 5: foo: command not found']
Error Investigation¶
Now here we know that the python
was set to an incorrect value, but this is not always the case, so the error would need more investigation.
First off, the errors
property only shows us the last line of the error. While this can be enough, lets see if there’s more to this particular error.
The ds.failed
property returns a list of all runners that report is_failed=True
. Runners also have a full_error
property which will return the full contents of the error file for you:
[16]:
print(ds.failed[0].full_error)
dataset-6fd64b82-runner-0-jobscript.sh: line 3: bar: command not found
dataset-6fd64b82-runner-0-jobscript.sh: line 6: foo: command not found
Hey look, here we can see the extra “bar” string that we set in the runner extra. But no extra information about our “foo” error.
Lets fix that by setting python
to something sensible. Lets also leave the bar
untouched for now, to see what happens:
[18]:
ds.url.python = "python3"
ds.run(force=True)
ds.wait(1, 10)
ds.fetch_results()
ds.results
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 4 Files... Done
Warning! Found 1 error(s), also check the `errors` property!
[18]:
[50, 105]
Since we didn’t update the extra="bar"
line, we stil have an error there! This is important to display that just because there is an error, does not necessarily mean that the run has failed.
[19]:
ds.errors
[19]:
['dataset-6fd64b82-runner-0-jobscript.sh: line 3: bar: command not found',
None]
Extra Fixes¶
However this is something that can be removed, simply updating the extra
to None
will remove this error:
[21]:
ds.get_runner(0).extra = None
ds.run(force=True, force_ignores_success=True)
ds.wait(1, 10)
ds.fetch_results()
ds.results
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 4 Files... Done
[21]:
[50, 105]
Submission Errors - shell
¶
Submission of a run requires more than a simple python
command, there are in fact two more similar arguments: submitter
, which is put into the master script, and shell
, which is used to launch the master script.
If you suspect that your shell
might be broken, there is a very simple way to see what was submitted:
[23]:
url = URL(shell="foo")
ds = Dataset(multiply, url=url, skip=False)
ds.append_run({"a": 10, "b": 5}, extra="bar")
ds.append_run({"a": 7, "b": 15})
ds.run()
ds.wait(1, 5)
ds.fetch_results()
ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[23], line 9
5 ds.append_run({"a": 7, "b": 15})
7 ds.run()
----> 9 ds.wait(1, 5)
11 ds.fetch_results()
13 ds.results
File ~/remotemanager/remotemanager/dataset/dataset.py:2388, in Dataset.wait(self, interval, timeout, watch, success_only, only_runner, force)
2386 t0 = int(time.time())
2387 # check all non None states
-> 2388 while not wait_condition():
2389 dt = int(time.time()) - t0
2391 if watch:
File ~/remotemanager/remotemanager/dataset/dataset.py:2361, in Dataset.wait.<locals>.wait_condition()
2360 def wait_condition():
-> 2361 states = self._is_finished(force=force)
2363 if only_runner is not None:
2364 return only_runner.is_finished
File ~/remotemanager/remotemanager/dataset/dataset.py:2305, in Dataset._is_finished(self, check_dependency, dependency_call, force)
2303 warnings.warn(msg)
2304 else:
-> 2305 raise RuntimeError(msg)
2307 if check_dependency and not dependency_call and self.dependency is not None:
2308 self.dependency.check_failure()
RuntimeError: Dataset encountered an issue:
/bin/bash: foo: command not found
Our wait timed out, which means no output files were produced, a surefire indicator of an error. If not even an error file was produced, it’s very likely that the calculations were never submitted, something that’s caused by a broken launch command. We can check this with the run_cmd
attribute:
[24]:
ds.run_cmd.sent
[24]:
'cd temp_runner_remote && foo dataset-6fd64b82-master.sh'
There’s a lot going on with this command, but all you really need to see here is the final section, where we can see our foo
. This can be changed back to bash
(or your preferred shell) via url.shell
Walltime Errors¶
Even if you make no mistakes on your end, it’s still possible for a run to time out. Or run out of memory. Or any other scheduler related issue. The fixes in this case are similar to the previous example. Bump up the walltime request if needed and resubmit, done.
To demonstrate this we’ll insert a string into the jobscripts that simulates a walltime issue, but also “hide” some “scheduler info” above.
[26]:
fake_walltime = '''
echo "{scheduler info}" >&2
echo out of walltime! >&2
exit 1'''
ds = Dataset(multiply, skip=False)
ds.append_run({"a": 10, "b": 5}, extra=fake_walltime)
ds.append_run({"a": 7, "b": 15})
ds.run()
ds.wait(1, 10)
ds.fetch_results()
ds.results
appended run runner-0
appended run runner-1
Staging Dataset... Staged 2/2 Runners
Transferring for 2/2 Runners
Transferring 7 Files... Done
Remotely executing 2/2 Runners
Fetching results
Transferring 3 Files... Done
Warning! Found 1 error(s), also check the `errors` property!
[26]:
[RunnerFailedError('out of walltime!'), 105]
[27]:
ds.errors
[27]:
['out of walltime!', None]
There’s our walltime line, perhaps the scheduler had more info for us?
[28]:
print(ds.failed[0].full_error)
{scheduler info}
out of walltime!
Seems it did, perhaps this content would give some advice useful for fixing your jobs (resource limits, etc).
Lets remove the walltime issue and resubmit. Here, we’re just removing the extra
, but in your case it may be on the URL
side of things.
Since only one job actually failed, we really only want to rerun that one. You can use the ds.failed
property to do this for you:
[29]:
for runner in ds.failed:
runner.extra = None
runner.run(force=True)
Staging Dataset... Staged 1/2 Runners
Transferring for 1/2 Runners
Transferring 5 Files... Done
Remotely executing 1/2 Runners
[30]:
ds.wait(1, 10)
ds.fetch_results()
ds.results
Fetching results
Transferring 2 Files... Done
[30]:
[50, 105]
Command Errors¶
In the background, Dataset
is using the provided URL
to issue commands on the remote machine. Sometimes, these can be the source of the failure.
The URL Tutorial has a section on error handling, but lets cover how to access these tools from the Dataset.
Each Dataset
will have a url
property, even if not set (one pointed at localhost
will be created for you). This can be accessed at any time to change things or check for issues.
[31]:
ds.url.host
[31]:
'localhost'
Arguably the most useful debugging tool is the cmd_history
property. This allows you to check the commands sent, up to the cmd_history_depth
(defaults to 10).
We can write some quick debugging code to go through the history and find a specific command.
Lets say we think there was a problem with rsync, all we need to do is iterate back through the history and see what’s there:
[32]:
transfer = None
for cmd in reversed(ds.url.cmd_history):
if "rsync" in cmd.sent:
transfer = cmd
break
print(transfer.sent)
rsync -auvh --checksum temp_runner_remote/{dataset-6fd64b82-runner-0-error.out,dataset-6fd64b82-runner-0-result.json} /home/test/remotemanager/docs/source/tutorials/temp_runner_local/
This command was used to retrieve the log from the remote, you can also see what was returned by the command execution:
[34]:
print(transfer.stdout)
sending incremental file list
dataset-6fd64b82-runner-0-error.out
sent 187 bytes received 38 bytes 450.00 bytes/sec
total size is 2 speedup is 0.01
And any stderr:
[35]:
print(transfer.stderr)
In this case we have none, but in the theoretical situation where the rsync has thrown errors, they will be printed in full here.
Combined Debugging¶
In many cases, your problem will require a mix of these tools and solutions. But with experience, hopefully you will find the data flow easy to follow. Some points to remember:
An error in the output does not necessarily mean a failed run, it could just be a warning.
Use the
failed
property in combination with the other runner-based tools to save having to search out the runners yourself.‘Runner.full_error` is invaluable in finding hidden parts to your errors.
Sometimes the
url
is at fault, check yourcmd_history
!Failing this, the
ds.run_cmd
will let you see if your run ever ran in the first place.